K-means Clustering

../../../../_images/dataset_blobs1.svg

Clusters data by trying to separate samples in n groups of equal variance

Documentation

Clusters data by trying to separate samples in n groups of equal variance

Configuration:

  • n_clusters

    The number of clusters to form as well as the number of centroids to generate.

  • n_init

    Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

  • init

    Method for initialization:

    ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

    ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

    If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

  • algorithm

    K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

    For now “auto” (kept for backward compatibiliy) chooses “elkan” but it might change in the future for a better heuristic.

    Changed in version 0.18: Added Elkan algorithm

  • max_iter

    Maximum number of iterations of the k-means algorithm for a single run.

  • tol

    Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

  • precompute_distances

    Precompute distances (faster but takes more memory).

    ‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

    True : always precompute distances.

    False : never precompute distances.

    Deprecated since version 0.23: ‘precompute_distances’ was deprecated in version 0.22 and will be removed in 0.25. It has no effect.

  • n_jobs

    The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center.

    None or -1 means using all processors.

    Deprecated since version 0.23: n_jobs was deprecated in version 0.23 and will be removed in 0.25.

  • random_state

    Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state.

Attributes:

  • cluster_centers_

    Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

  • labels_

    Labels of each point

  • inertia_

    Sum of squared distances of samples to their closest cluster center.

Input ports:

Output ports:
modelmodel

Model

Definition

Input ports

Output ports

model

model

Model

class node_clustering.KMeansClustering[source]